A worked basic example of a survival analysis in R
Author
Anshu Uppal
Published
May 16, 2025
Background
I used the rotterdam dataset from the survival package, which includes 2982 primary breast cancers patients whose records were included in the Rotterdam tumor bank.
This dataset records two events: disease relapse and death. Follow-up time is provided for each, with the origin being the time of the primary surgery.
In the article from which this data stems, the authors restricted their dataset to the 1546 patients who had node-positive disease (“Since the validation dataset comprised only node-positive patients and nodal status is an important prognostic factor, we omitted the node-negative patients to create the derivation dataset”). I decided to also restrict the dataset to only node-positive patients.
See:
Royston, P., Altman, D.G. External validation of a Cox prognostic model: principles and methods. BMC Med Res Methodol 13, 33 (2013). https://doi.org/10.1186/1471-2288-13-33
Prognostic factors
Quoted paragraph from the article:
“Candidate prognostic variables in the breast cancer datasets were age at primary surgery (age, years), menopausal status (meno, 0 = premenopausal, 1 = postmenopausal), tumour size (size), tumour grade (grade), number of positive lymph nodes (nodes), progesterone receptors (pgr, fmol/l), oestrogen receptors (er, fmol/l), hormonal treatment (hormon, 0 = no, 1 = yes), and chemotherapy (chemo). Tumour size (mm) was not available as a continuous variable in the Rotterdam dataset, therefore a standard coding was used; the base category was ≤ 20 mm and two dummy variables were used, namely 20 to 50 mm (sized1) and > 50 mm (sized2). We excluded grade, since it was measured according to a different protocol in the two datasets, and chemo, since all patients in the validation dataset received chemotherapy.”
Code
# install.packages("pacman")pacman::p_load( here, tidyverse, survival, survminer, ggsurvfit, survRM2, # calculate RMST with confidence intervals janitor, DT, flextable, patchwork # easily combine plots)# Custom package with useful function for summarising dataframe# pak::pak("UEP-HUG/UEPtools")# Load in the rotterdam dataset from the survival packagerotterdam <- survival::rotterdam |># filter out node-negative patientsfilter(nodes >0) # as "nodal status is an important prognostic factor, we omitted the node-negative patients to create the derivation dataset"# Add in the variable descriptionrotterdam_variables <-tribble(~name, ~value,"pid", "patient identifier", "year", "year of surgery", "age", "age at surgery", "meno", "menopausal status (0= premenopausal, 1= postmenopausal)", "size", "tumor size, a factor with levels <=20, 20-50, >50", "grade", "differentiation grade", "nodes", "number of positive lymph nodes", "pgr", "progesterone receptors (fmol/l)", "er", "estrogen receptors (fmol/l)", "hormon", "hormonal treatment (0=no, 1=yes)", "chemo", "chemotherapy", "rtime", "days to relapse or last follow-up", "recur", "0= no relapse, 1= relapse", "dtime", "days to death or last follow-up", "death", "0= alive, 1= dead")# Define a function for generating a clean table from the KM estimatesKM_table <-function(x){ x |># the survfit2() object ggsurvfit::tidy_survfit(times =seq(1000,7000,1000)) |> dplyr::relocate(std.error, .after = conf.low) |> dplyr::mutate(across(c(estimate:conf.low), ~round(.x,2))) |> dplyr::mutate(std.error =round(std.error, 4)) |> dplyr::select(-c(estimate_type:conf.level)) |> flextable::flextable() |> flextable::fontsize(size =9, part ="all") |>autofit() |> flextable::theme_zebra()}# Define a function for generating the KM curvesKM_plot <-function(x, strataname =NULL){ x |>ggsurvfit()+add_censor_mark(size =1.5, alpha =0.8) +add_confidence_interval() +add_risktable(risktable_stats =c("n.risk", "cum.censor", "cum.event")) +add_quantile(y_value =0.5, color ="gray50", linewidth =0.5) +scale_ggsurvfit()+add_pvalue()+labs(title ="Relapse-free survival after surgery",color = strataname,fill = strataname )}
Code
/* set DT table fontsizes */th { font-size:11px; } /* header font */td { font-size:11px; } /* cell font */
Inspect Rotterdam data
I went through the package’s documentation for the dataset to also include the descriptions of the variables, and used a custom function to summarise the dataset:
Code
print(paste0("Number of patients analyzed: ", nrow(rotterdam)))
[1] "Number of patients analyzed: 1546"
Code
rotterdam |> UEPtools::metadata_generator_any(variable_names = rotterdam_variables) |>select(-Num_Values) |> DT::datatable(filter ="top",options =list(pageLength =26 ),rownames =FALSE, # set to FALSE for cleaner lookclass ='cell-border stripe hover nowrap compact' )
Process dataset for downstream analysis
Create new outcome variable rd, where 0= alive without relapse, 1= relapse or death.
Also create rd_time variable where I calculate days to first of relapse, death or last follow-up
Code
rotterdam_cleaned <- rotterdam |>mutate(# Convert several variables to factormeno =fct_recode(factor(meno), premenopausal ="0", postmenopausal ="1"),grade =factor(grade),hormon =fct_recode(factor(hormon), No ="0", Yes ="1"),chemo =fct_recode(factor(chemo), No ="0", Yes ="1"),# Add categorical variable for number of nodesnodes_cat =factor(case_when( nodes >0& nodes <4~"1-3", nodes >=4& nodes <9~"4-9", nodes >=9~"9+" )),# Add categorical variable for number of estrogen receptorser_cat =factor(case_when( er <41~"0-40", er >=41& er <301~"41-300", er >=301~"301+" ),levels =c("0-40", "41-300", "301+")),# Add categorical variable for number of progesterone receptorspgr_cat =factor(case_when( pgr <21~"0-20", pgr >=21& pgr <201~"21-200", pgr >=201~"201+" ),levels =c("0-20", "21-200", "201+") ),# `rd` where 0= alive without relapse, 1= relapse or deathrd =case_when(recur==1|death==1~1, .default =0),# Calculate days to first of relapse, death or last follow-uprd_time =case_when(recur==1~ rtime, .default = dtime),# Add categorical age variableage_cat =factor(case_when( age <65~"24-64", age >=65~"65+")) )# Update the rotterdam_variables objectrotterdam_variables <-bind_rows( rotterdam_variables,tribble(~name, ~value,"rd", "0= alive without relapse, 1= relapse or death","rd_time", "days to relapse/death or last follow-up","age_cat", "age category at surgery"))
Visualise the distribution of numeric variables
Code
p1 <- rotterdam_cleaned |>ggplot()+geom_histogram(aes(x=age))+labs(x="age at surgery")p2 <- rotterdam_cleaned |>ggplot()+geom_bar(aes(x=size))+labs(x="tumor size")p3 <- rotterdam_cleaned |>ggplot()+geom_histogram(aes(x=nodes))+labs(x="number of positive lymph nodes")p4 <- rotterdam_cleaned |>ggplot()+geom_histogram(aes(x=pgr))+labs(x="progesterone receptors (fmol/l)")p5 <- rotterdam_cleaned |>ggplot()+geom_histogram(aes(x=er))+labs(x="estrogen receptors (fmol/l)")p6 <- rotterdam_cleaned |>ggplot()+geom_bar(aes(x=hormon))+labs(x="hormonal treatment")p7 <- rotterdam_cleaned |>ggplot()+geom_bar(aes(x=meno))+labs(x="menopausal status")p8 <- rotterdam_cleaned |>ggplot()+geom_histogram(aes(x=rd_time))+labs(x="days to relapse/death or last follow-up")p1+p2+p3+p4+p5+p6+p7+p8+plot_layout(ncol =2)
Analysis
Kaplan-Meier curve
Relapse-free survival probability at different timepoints:
Code
# Fit a Surv-type object for right-censored datarotterdam_surv_fit_rd <-# survival::survfit(Surv(rd_time, rd) ~ 1, data = rotterdam_cleaned)# Using survfit2 from the ggsurvfit package allows easier control for # downstream analysis using the ggsurvfit() function: ggsurvfit::survfit2(Surv(rd_time, rd) ~1, data = rotterdam_cleaned)# Print its summary at specific timesrotterdam_surv_fit_rd |>KM_table()
time
n.risk
n.event
n.censor
cum.event
cum.censor
estimate
conf.high
conf.low
std.error
1,000
928
608
10
608
10
0.61
0.63
0.58
0.0205
2,000
573
289
66
897
76
0.41
0.44
0.39
0.0307
3,000
321
105
147
1,002
223
0.33
0.35
0.30
0.0384
4,000
112
62
147
1,064
370
0.24
0.27
0.22
0.0547
5,000
24
14
74
1,078
444
0.19
0.22
0.15
0.0970
6,000
3
2
19
1,080
463
0.14
0.22
0.09
0.2196
7,000
1
0
2
1,080
465
0.14
0.22
0.09
0.2196
Restricted Mean Survival Time (RMST): Restricted Mean Survival Time (RMST) is a statistical measure used in survival analysis that offers a clinically meaningful way to compare survival outcomes between groups. It represents the average event-free survival time up to a pre-specified time point, and is defined as the area under the survival curve from time zero up to a specific, pre-defined time point. (using the survival package, I can only print the RMST for single groups or strata, but for more detailed comparisons between groups (e.g. getting the difference in RMST with confidence intervals) I could try more specialised packages, e.g. survRM2.
“The RMST may provide valuable information for comparing two survival curves when the proportional hazards assumption is not met, such as in cases of crossing or delayed separation of survival curves.”
- Han K, Jung I. Restricted Mean Survival Time for Survival Analysis: A Quick Guide for Clinical Researchers. Korean J Radiol. 2022 May;23(5):495-499. doi: 10.3348/kjr.2022.0061
For this analysis, I’ve set the restriction time at 5 years (1,825 days):
Restricted Mean Survival Time:Over 5 years (1825 days)
Code
# Print mean survival timeprint(rotterdam_surv_fit_meno, rmean =365*5, print.rmean =TRUE)
Call: survfit(formula = Surv(rd_time, rd) ~ meno, data = rotterdam_cleaned)
n events rmean* se(rmean) median 0.95LCL 0.95UCL
meno=premenopausal 628 396 1298 24.1 1749 1510 2141
meno=postmenopausal 918 684 1174 21.4 1275 1172 1442
* restricted mean with upper limit = 1825
Over 10 years (3650 days)
Code
# Print mean survival timeprint(rotterdam_surv_fit_meno, rmean =365*10, print.rmean =TRUE)
Call: survfit(formula = Surv(rd_time, rd) ~ meno, data = rotterdam_cleaned)
n events rmean* se(rmean) median 0.95LCL 0.95UCL
meno=premenopausal 628 396 2046 55.2 1749 1510 2141
meno=postmenopausal 918 684 1733 44.4 1275 1172 1442
* restricted mean with upper limit = 3650
KM curve:
Code
rotterdam_surv_fit_meno |>KM_plot()
Log-rank test
Use a log-rank test, which accounts for the whole follow-up period. By comparing the observed number of events in each group with the number that would be expected if event rates were the same, a chi-squared test is used to determine if any observed differences are statistically meaningful.
Code
survdiff(Surv(rd_time, rd) ~ meno, rho =0, data = rotterdam_cleaned)
Call:
survdiff(formula = Surv(rd_time, rd) ~ meno, data = rotterdam_cleaned,
rho = 0)
N Observed Expected (O-E)^2/E (O-E)^2/V
meno=premenopausal 628 396 479 14.4 26
meno=postmenopausal 918 684 601 11.5 26
Chisq= 26 on 1 degrees of freedom, p= 3e-07
Regression modelling (CoxPH)
Visually inspect individual KM curves for covariates
Numeric covariates are categorized here only to aid visualisation
chisq df p
age 4.05333 1 0.04408
meno 0.00248 1 0.96026
size 13.17300 2 0.00138
nodes 6.73579 1 0.00945
pgr 15.02219 1 0.00011
er 24.55857 1 7.2e-07
hormon 2.99857 1 0.08334
GLOBAL 61.45472 8 2.4e-10
The ‘GLOBAL’ test appears to show that the PH assumption is violated for the overall model, and also appears to be violated for the age, size, nodes, and er variables.
Visual inspection of the test:
Plots the scaled Schoenfeld residuals against transformed time for each covariate.
A non-horizontal line with a non-zero slope suggests a violation of the PH assumption.
While the results of the test of the proportional hazards assumption imply violation of the proportional hazards assumption, after visually inspecting the Schoenfeld residuals the validity of the assumption may still be argued to hold. Still, next steps may involve some combination of transformation of variables and/or consideration of including time-dependent interaction terms.
We could additionally further explore comparing estimates of Restricted Mean Survival Time (RMST), e.g. see:
Han K, Jung I. Restricted Mean Survival Time for Survival Analysis: A Quick Guide for Clinical Researchers. Korean J Radiol. 2022 May;23(5):495-499. doi: 10.3348/kjr.2022.0061
“We conclude that the hazard ratio cannot be recommended as a general measure of the treatment effect in a randomized controlled trial, nor is it always appropriate when designing a trial. Restricted mean survival time may provide a practical way forward and deserves greater attention.”
- Royston, P., Parmar, M.K. Restricted mean survival time: an alternative to the hazard ratio for the design and analysis of randomized trials with a time-to-event outcome. BMC Med Res Methodol 13, 152 (2013). https://doi.org/10.1186/1471-2288-13-152
RMST
First show again the RMST estimates from the survival package for 5-year (1825 days) survival (no confidence intervals given for the RMST estimates):
Call: survfit(formula = Surv(rd_time, rd) ~ meno, data = rotterdam_cleaned)
n events rmean* se(rmean) median 0.95LCL 0.95UCL
meno=premenopausal 628 396 1298 24.1 1749 1510 2141
meno=postmenopausal 918 684 1174 21.4 1275 1172 1442
* restricted mean with upper limit = 1825
Calculate and compare 5-year (1825 days) survival in the menopausal status groups using the survRM2 package, which includes confidence intervals for the RMST estimates (arm = 0: premenopausal, arm = 1: postmenopausal)
Code
rmst_meno <- survRM2::rmst2(time = rotterdam_cleaned$rd_time, status = rotterdam_cleaned$rd, arm =as.numeric(rotterdam_cleaned$meno)-1,tau =1825) rmst_meno
The truncation time: tau = 1825 was specified.
Restricted Mean Survival Time (RMST) by arm
Est. se lower .95 upper .95
RMST (arm=1) 1174.471 21.379 1132.570 1216.373
RMST (arm=0) 1297.652 24.141 1250.337 1344.967
Restricted Mean Time Lost (RMTL) by arm
Est. se lower .95 upper .95
RMTL (arm=1) 650.529 21.379 608.627 692.430
RMTL (arm=0) 527.348 24.141 480.033 574.663
Between-group contrast
Est. lower .95 upper .95 p
RMST (arm=1)-(arm=0) -123.180 -186.382 -59.978 0
RMST (arm=1)/(arm=0) 0.905 0.860 0.952 0
RMTL (arm=1)/(arm=0) 1.234 1.105 1.378 0
From the above, the unadjusted RMST ratio of postmenopausal:premenopausal is 0.905 [95% CI: 0.860, 0.952], i.e. RMST is significantly lower among postmenopausal relative to premenopausal.
Model with covariates (survRM2 package)
Run the same model as above, but add in all of the covariates
Code
rmst_meno_c <- survRM2::rmst2(time = rotterdam_cleaned$rd_time, status = rotterdam_cleaned$rd, arm =as.numeric(rotterdam_cleaned$meno)-1,tau =1825,covariates =c(rotterdam_cleaned$age, rotterdam_cleaned$size, rotterdam_cleaned$nodes, rotterdam_cleaned$pgr, rotterdam_cleaned$er, rotterdam_cleaned$hormon)) rmst_meno_c
The truncation time: tau = 1825 was specified.
Summary of between-group contrast (adjusted for the covariates)
Est. lower .95 upper .95 p
RMST (arm=1)-(arm=0) -156.484 -167.008 -145.959 0
RMST (arm=1)/(arm=0) 0.880 0.874 0.887 0
RMTL (arm=1)/(arm=0) 1.302 1.279 1.325 0
Model summary (difference of RMST)
coef se(coef) z p lower .95 upper .95
intercept 1229.374 4.041 304.196 0 1221.453 1237.295
arm -156.484 5.370 -29.142 0 -167.008 -145.959
covariates 1.572 0.001 1608.053 0 1.571 1.574
Model summary (ratio of RMST)
coef se(coef) z p exp(coef) lower .95 upper .95
intercept 7.112 0.003 2348.937 0 1226.338 1219.083 1233.637
arm -0.127 0.004 -31.970 0 0.880 0.874 0.887
covariates 0.001 0.000 2175.876 0 1.001 1.001 1.001
Model summary (ratio of time-lost)
coef se(coef) z p exp(coef) lower .95 upper .95
intercept 6.379 0.007 870.916 0 589.075 580.679 597.592
arm 0.264 0.009 28.911 0 1.302 1.279 1.325
covariates -0.003 0.000 -123.062 0 0.997 0.997 0.997
From the above, the adjusted RMST ratio of postmenopausal:premenopausal is 0.880 [95% CI: 0.874, 0.887], i.e. RMST remains significantly lower among postmenopausal relative to premenopausal after adjustment.